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Abstract 

We provide algorithms for efficiently addressing quantum memory in parallel. 
These imply that the standard circuit model can be simulated with low overhead 
by the more realistic model of a distributed quantum computer. As a result, the 
circuit model can be used by algorithm designers without worrying whether the 
underlying architecture supports the connectivity of the circuit. In addition, we 
apply our results to existing memory intensive quantum algorithms. We present 
a parallel quantum search algorithm and improve the time-space trade-off for the 
Element Distinctness and Collision problems. 



1 Introduction 

There is a significant gap between the usual theoretical formulation of quantum algo- 
rithms and the way that quantum computers are likely to be implemented. Descriptions 
of quantum algorithms are often given in one of two theoretical models: the quantum 
Random Access Memory (RAM) model (possibly equipped with an oracle) or the circuit 
model, both of which essentially ignore locality issues. On the other hand, any imple- 
mentation is likely to be mostly local in two or three dimensions, with a small number 
of long-range connections and, due to the presumed requirements of fault-tolerance [1] , 
to involve concurrent execution of one- and two-qubit gates with fast classical control. 
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This is unlike classical computers, whose implementations are currently dominated by 
von Neumann architectures where 0(1) cores share access to a large RAM. 

We address problems with both theoretical models of quantum computing, demonstrat- 
ing that a single idea — reversible sorting networks — can be used to efficiently relate them 
to a model of computation that is more physically realistic. For example, we show how 
quantum computers can efficiently access memory in a parallel and unrestricted way. 
In fact, the memory does not need to be in a single place, but could be distributed 
amongst many processors. These ideas are naturally extended to relate any quantum 
algorithm presented in terms of a quantum circuit to a distributed quantum computer in 
which each processor acts on a few well-functioning qubits connected to a small number 
of other memory sites (possibly via long-range interactions). Hence experiments such 
as NV centres in diamond, and trapped ions connected using optical cavities [12, 24], 
or cavity QED for superconducting qubit networks [6, 28], could be used to efficiently 
implement quantum algorithms presented in the circuit model. 

Our results can be summarized as relating the following three models of quantum com- 
putation presented pictorially in Fig. 1: the well known circuit model (where a single 
processor can perform up to N concurrent operations on any of the qubits), quantum 
parallel RAM (where N processors can each perform one operation per time step on any 
part of the memory), and a more physically realistic model, distributed quantum comput- 
ing (where processors with local memory are laid out in a certain fixed topology). 

We provide an algorithm (circuit) that can look up multiple memory entries in parallel, 
and prove that this algorithm (measured in terms of circuit complexity) is scarcely more 
expensive than any circuit capable of accurately accessing even a single entry. These 
results are explained in more detail in §2 and §3 and culminate in the following two 
Theorems. 

Theorem 1. A quantum circuit which accesses one of N memory bits given its index 
requires width ^}{N) and depth ^}{log{N)). 

Theorem 2. There is a uniform family of quantum circuits computing, from N indices 
ji, . . . , and N bits xi , . . . , xm, the N bits xj^ , • • • , xjj^ . This circuit family has width 
0{N log N) and depth 0(log A^loglogiV). 

The word "compute" in Theorem 2 refers to replacing some target registers yi , . . . , y^r 
with ?/i © Xj^ , . . . , yAT © Xj^ . The input yi can be interpreted as the register state of pro- 
cessor i and the computation © xj. as a memory request by this processor. Therefore, 
Theorem 2 allows N processors unrestricted access to a shared memory, demonstrating 
the equivalence (up to log factors) between the quantum parallel RAM and the circuit 
models. 

We also provide an algorithm for efficiently moving data (Theorem 3). Distinct indices 
are simply permuted, replacing xi, . . . ,xi^ with Xj^, . . . ,Xjj^. Armed with Theorem 3, 
we relate the circuit model to the distributed quantum computing model in §4. We 
consider a particular topology that is based on an efficient sorting network, and show 
that it leads to a physically realistic distributed quantum computer that has a very small 
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(a) (c) 




Figure 1: Our results can be summarized as relating the following models of quantum 
computation: (a) The circuit model - a single processor can access all of the memory and 
perform up to operations concurrently; (b) Quantum parallel RAM - N processors 
with unrestricted access to the memory, each processor can only perform one opera- 
tion per time step; (c) Distributed quantum computing - N single operation processors 
with restricted access to the memory and (d) a redrawing of the picture (c) where we 
emphasize the locality of the model. In the illustration, we depict the case = 4. 

overhead for simulating arbitrary quantum circuits. 

In §4.2, we take the concept of emulating arbitrary circuits on a restricted, non-monolithic 
quantum computer one step further. We prove that given (by an experimentalist) any 
layout of qubits grouped into memory sites, a device with this layout can implement 
algorithms presented in the circuit or quantum RAM models. Of course there is a price 
to pay: the overhead depends on the topology of the processors, but our algorithm is 
close to optimal. More precisely, we prove the following Theorem. 

Theorem 5. Let Q be a graph on N vertices, and let Dg denote the minimum depth 
of a sorting algorithm over Q. Arbitrary quantum circuits of width N can be emulated 
on Q with an overhead factor of 0{Dg\ogN). On the other hand, any algorithm for 
emulating arbitrary circuits must cost at least Cl{Dg / log N). 
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Thus, the quantum parallel RAM model can be efficiently simulated using a distributed 
quantum computer. We consider, in §5, various quantum search problems in the quan- 
tum parallel RAM model. These include the element distinctness problem, which in 
the past has been considered in the quantum RAM model [4, 10] as well as in a par- 
allel model with no communication [17]. In both of these models, the best time space 
trade-off that has been achieved is ST"^ = 0{N'^), where S represents memory space, T 
represents time, and is the number of elements to test for distinctness. These elements 
are given by a subroutine or circuit which computes them from their indices. Grover and 
Rudolph [17] pose beating this trade-off as a challenge which we answer in the following 
theorem. 

Theorem 7. There is a quantum algorithm solving the element distinctness problem 
that has a time-space trade-off 



2 The cost of accessing quantum memory 

In this Section, we motivate and introduce the primitives of parallel memory look-ups 
and data-moving. We describe methods of measuring the cost of an algorithm, and give 
the precise statement of Theorems 1, 2 and 3. 

The cost of our algorithms are all specified in terms of the circuit model. All circuits 
developed in this Section and in the proofs will be comprised of reversible (unitary) 
gates that map computational basis vectors to computational basis vectors. In this 
sense, our algorithms in Sections 2, 3 and 4 are entirely classically reversible, nowhere 
introducing nor exploiting quantum superposition. Moreover, all our circuits perfectly 
clean any ancilla qubits that they use, so that these circuits can be used as quantum 
subroutines. 

Requirements will be shown in terms of logical unitaries acting on input registers, with- 
out always explicitly showing auxiliary registers, though such ancillas will naturally be 
required. These are always presumed to be provided in the canonical pure computational 
basis state |0), and will always be returned to that state on algorithm termination, in- 
dependent of any input. In §3, we will prove the Theorems by providing circuits that 
implement the required logical unitaries. 

A circuit is decomposed (perhaps recursively) into a series of subroutines. In the circuit 
model, every subroutine is built out of gates from a universal gate set (cf. for example, 
reference [21]). Gates can be implemented concurrently — that is, within the same times- 
lice — whenever they act on disjoint sets of qubits. The cost of a circuit is measured by 
three parameters: 

• depth, the number of timeslices required in the circuit; 

• size, the total number of gates in the circuit, and 
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• width, the total number of qubits (inputs + ancillas) required in the circuit. 

These three are ah taken to be functions of the logical unitary width, which is the number 
of input qubits required for the logical unitary. 

2.1 The cost of a single look-up 

Often when quantum algorithms arc qTioted in the query model, the concept of an oracle 
is used to abstract away those logical unitaries that are intended to make 'random' access 
to (quantum) memory. In this Section, we examine the idea of accessing memory in more 
detail, without using oracles. 

Definition 1. A logical unitary for accessing a single piece of data is a map t^(i,Ar) ^hat 
implements 

U(1,N) ■ \j)\y)\xl,X2,...,XN) \j)\y®Xj)\xi,X2,...,XN), (2) 

where © denotes bitwise addition. 

Here we have depicted 2 + A'' registers. The first register (index register) holds an index 
large enough to 'point' to any one of N data registers; its associated Hilbert space must 
be of dimension at least N. The second register is called the target register and holds 
the same kind of data as a data register. The other N registers are data registers and 
could in principle be of any (equal) size. 

We can derive a simple lower bound for the cost of accessing a single piece of memory 
based on two simple constraints on any circuit for C/(i^iv)- The circuit must hold the 
entire database and there must be a causal chain from every data register to the target 
register. More precisely, we have the following Theorem. 

Theorem 1. If a circuit implements on N data registers each consisting of d 

qubits, then its width is Q,{Nd) and its depth is fl{log{N)). 

Proof : The width of the circuit must be Q{Nd) because this is the logical width of 
the unitary (the number of input/output qubits). Since any data register could affect 
the target register, there must be a causal chain of gates from any data register to the 
target register. Each permitted gate touches 0(1) registers, so the depth of the longest 
chain must be 0.{log{N)). □ 

There is a sense in which the gates in a typical circuit for i^(i,Af) can be said to be "not 
working very hard" (although this idea is hard to quantify precisely), and this inefficiency 
points to the need for a parallel algorithm. 
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2.2 The cost of parallel look-ups 




Here we have depicted N index registers, target registers, and data registers, 
comprising a total of 3A^ input registers. As before, the index registers are each made 
up of [log2 N~\ qubits, while the target registers and data registers are each made up of 
d qubits. 

Theorem 2. For data registers of d qubits each, there exists a uniform family of circuits 
implementing f7(7v.Ar), having width 0{N(log{N) + d)) and depth 0(log A^log(dlog A^)). 

Proved in ^3.4- Q 

Note that the width of the circuit for C/(Ar^Ar) is linear in the width of the logical unitary 
itself, which is the best that could be hoped for. Theorems 1 and 2 tell us that for parallel 
memory lookups, we can achieve a factor N more 'effect' for only a small additional 
'effort'. This will be seen to have radical effects on certain 'memory-intensive' algorithms 
in §5. 

2.3 Data- moving 

Two qubits can easily be moved using a SWAP gate (for example), but then the locations 
of the qubits to be exchanged must be known at the time the circuit is created ( 'compile 
time'). Moreover, in architectures such as distributed quantum computing (which we 
will discuss in more detail in §4), it is not possible to implement a SWAP gate between 
arbitrary qubits, due to the locality constraints. We could use two [/(at^at) circuits to 
perform an arbitrary permutation of the indices, mapping the state \xi,X2, ■ ■ ■ ,X]\f) to 
I ^ji ^^021- ■■ 1 ^Jn ) • However, by considering the permutation problem directly, we present 
a circuit that is simpler and cheaper (by a factor 2) than using t/(jv,Ar) twice. 

The No- Cloning Theorem [29] prevents us from duplicating quantum data when it is not 
in a known basis, and so a data- moving algorithm is only to be concerned with permuting 
a set of qubits, without trying to copy any. For this reason, the logical unitary for moving 
data is only defined on states whereby, in every branch of the quantum superposition, 
the index registers collectively hold distinct pointers. 

Definition 3. A logical unitary for data moving is a map Vm acting on quantum reg- 
isters that implements 



Vn ■■ \jl,j2,---jN)\xi,X2,...,XN) 
^ \jl,j2,---,jN)\xj^,Xj, 



■ ■ ■ 1 -^jN I ' 



(4) 
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whenever the ji are all distinct. 

As before, the index registers are each taken to be of [log2 N~\ qubits, while the N 
data registers are each of d qubits. (It is not important to us how Vn behaves on input 
states violating the given guarantee, in the same way that we don't care what a circuit 
does when its ancilla qubits are input in a state other than |0).) 

In §3.2, we present quantum circuits for V/v that are closely related to the circuits for 
U(N,N)- (III fact, it is simpler to describe our solution for Vjy first.) 

Theorem 3. For data registers of d qubits each, there exists a uniform family of cir- 
cuits implementing Vjy deterministically, having width 0{N{log{N) + d)) and depth 
0(logiVlog(dlog N)). 

Proved in §5.^. □ 



3 Algorithm Descriptions 

In this Section, we describe and analyse algorithms (circuits) for accessing memory in 
parallel, f/(Ar,Ar), and data moving, V/v, and prove the corresponding Theorems 2 and 3. 
These algorithms share the same overall structure. A sorting network is used to sort 
the data whilst saving some auxiliary bits. Next, we apply a transformation to the 
sorted data; either a permutation, in the case of data moving, or a general copying 
transformation, for parallel look-ups. Finally we pass the transformed data through the 
sorting network in reverse, using the auxiliary bits to ensure that the sorting is correctly 
reversed. 

We begin this Section by describing the main subroutine for the algorithms; a classical 
reversible sorting network. Then we present the conceptually simpler algorithm for data 
moving, before going on to describe our parallel look-up algorithm. Both the parallel 
look-up algorithm and the data-moving algorithm are in fact classical reversible circuits. 
Indeed, the tools we use (principally sorting networks) were developed originally for 
classical parallel computing [26, 27, 19]. We were unable to find our Theorems 2 or 3 in 
the literature (although [26] is somewhat similar), and we suspect that may be because 
quantum computing presents architectural issues that are different to those encountered 
in the field of traditional classical parallel computing. 

3.1 Sorting networks 

A sorting network is a network on T wires in which each wire represents one of T 
elements and where the only gates are binary comparators. A binary comparator takes 
two elements on its input, and it outputs the same two elements but in the correct order, 
according to some specified comparison routine. To make it reversible, each comparator 
additionally has its own sorting- ancilla bit that is to enter the comparator in state |0). 
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This sorting-ancilla gets flipped to a |1) if and only if the comparator exchanges the 
order of the two elements input. In more detail, the comparator first compares the 
two elements, storing the resulting bit in the sorting-ancilla; then conditioned on the 
sorting-ancilla being set, it swaps over the two elements. These two steps are each 
clearly reversible in their own right. The sorting-ancillas are then all retained, to enable 
'unsorting' later. A reversible comparator for full lexicographic sort on 6-bit objects was 
constructed by Thapliyal et al. [25]. Their algorithm is based on the binary tree and is 
efficient for our purposes having width 0{b) and depth 0(log6). 

All of the comparators in any given sorting network should be identical, and when any T 
elements are input to the sorting network, its overall effect should be to output them in 
totally sorted order, with certainty. The network can of course be designed completely 
independently of the comparator, since the details of what makes one element 'greater', 
'less', or 'equal' to another is irrelevant from the perspective of which binary comparisons 
are needed to guarantee sorting. Thus for any value of T, one can ask for the lowest 
depth sorting network for sorting T elements. In references [2, 22], it is shown that T 
elements can be sorted in O(logr) depth of comparators, and that a uniform family of 
sorting networks achieves this; as needed for our Theorems 2 and 3. Unfortunately the 
constant for Paterson's simplified version of the AKS sorting network [22] is around 6,100 
and so not practical for any realistic sizes of T. However, the bitonic sorting network 
[18] sorts T = 2* elements in depth 1 + 2 + ... + t = t{t + l)/2 = 0{log^ T). The case 
T = 8 is illustrated in Fig. 2. 




1 

2 
3 
4 
5 
6 
7 



Figure 2: The bitonic sorting network for 8 elements. Each gate in the network represents 
a comparator between two elements. (The auxiliary registers required to make the overall 
circuit reversible have been suppressed for clarity.) 

Write s{T) for the total number of comparators appearing in a given sorting network for 
T elements. A sorting operation is written as follows: 



St 



|0) ®\^\Xk) 

k=l 



Ca(fc))- 



(5) 



k=l 



Here a S Sym(r) denotes a permutation that puts the T elements in order, and the first 
register (storing a) actually consists of the s{T) sorting-ancilla bits that get fed into the 



8 



comparators. It is clear from this observation that s{T) > log2(r!) = Q(riogT) for 
any vahd sorting network, since potentially any element of Sym(T) might be a uniquely 
correct one. If a sorting network has depth D, then it uses a total of 0{D-T) comparators, 
and so depth must he D = r2(logT) for any valid sorting network. 

Because the sorting subroutine is reversible, it makes sense to run it backwards. When 
that happens, the comparators are encountered in reverse order, and each comparator 
swaps the order of its inputs according to whether its sorting-ancilla bit is set. That 
sorting-ancilla bit is then reversibly cleared (regardless of its value) by 'uncomputing' 
the comparison between the two elements. 

3.2 The data-moving algorithm 

A circuit for data moving, Vjy, depends on just two parameters: the number of data 
registers, N, and the number of bits in each data register, d. That is to say, the same 
circuit can be used for moving different 'kinds' of data, since the circuit treats data 
items simply as contiguous arrays of d bits per item, regardless of what this data might 
signify. 

The version of a circuit for V/v that we now describe is slightly more complicated than 
is strictly necessary. But this description has the advantage that it establishes the 
framework for the parallel look-up circuit, ?7(Ar,Ar) (see §3.3). 

It is convenient to break Vn into three basic parts: formatting, sorting, and applying the 
permutation. The formatting and sorting both need to be reversed after the permutation 
has been applied. Interestingly, the final formatting is not the exact reverse of the initial 
formatting, so the algorithm overall isn't 'trivially' a reversible version of a standard 
classical algorithm. This can all be annotated as follows, reading right to left: 

Vn = FoS^^oPoS2NoF. (6) 

We must be careful to count the depth and additional ancilla space required by each of 
these subroutines (see §3.4). 

1) Initial Formatting The subroutine F (described below) can be achieved using 
SWAP gates and Pauli X gates with no additional ancillas and in depth 1. It is not 
really a 'computation' at all, rather a rearrangement of the input data into a format 
amenable for describing the sorting that will follow. 

Let there be 2N ancilla registers that we call packets. Each packet contains an address 
([log2 N~\ bits), a flag (1 bit), and a data {d bits) register. The initial format moves the 
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input data out of the input registers and into the packets, ready for sorting: 

N 

F : |ji,i2,...,J^>ki,a;2,...,Xiv)(g)|(0,0,0))2» |(0,0,0))2m (7) 

i=l 

N 

^ |0,0,...,0)|0,0,...,0)(g)|(«,l,Xi))2i |(j*,0,0))2i+i, 

1=1 

where ja 7^ jb for all a ^ b. 

One can think of the map F in the following terms. Each processor, i, submits the 
packet (i, 1, Xj), and if the data it would like to obtain is at index ji, it also submits the 
packet (ji, 0, 0). The flag is used in the sort and unsort steps to record whether the state 
was originally a storage state (1) or a request for data (0). 

2) Sorting Sorting was described in §3.1. In this context, we want to sort the 2N 
packets, using a lexicographical ordering that reads only the address then the flag of 
each packet. In accordance with Eqn. (5), the subroutine S2N must employ a register of 
s{2N) sorting- ancilla bits, mapping as 

N 

S2N : |0)»(g)|(i,l,x,))2i |(ii,0,0))2»+i (8) 

i=l 

N 

^ k)®(2)l(*,0,0))2i \{i,l,0Ci))2i+l, 
i=l 

where a G Sym(2A^) is the permutation implied by 2i = a(2i + 1) and 2i + 1 = o"(jj), 
for i from 1 to A^. 

The total depth of the circuit for unitary S2N is equal to the depth of the sorting network 
multiplied by the depth of a single comparator (see §3.4 for a careful analysis). 

3) Apply the Permutation After the sort, we are left with a sequence of packets of 
the form 

... (i - 1, 0, 0) (i - 1, 1, {i, 0, 0) {i, 1, X,) {i + 1,0, 0) (i + 1, 1, ... (9) 

where the packets are sorted in lexicographical order according to the address and the 
flag. This ordering makes the permutation step especially simple. Without the need for 
any auxiliary bits, and in depth 1, SWAP gates can achieve the map 

N N 
F : (g)|(«,0,0))2» |(i,l,x,))2m ^ (g)|(i,0,x,))2i |(i,l,0))2i+i. (10) 

i=l 1=1 

An important property of P is that it does not change the address or the flag used in 
the sort. 
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4) Unsorting The sorting network can be run in reverse to return the packets to their 
original positions. To achieve the unsort, acts on the sorting-ancilla register and 
the packets, mapping 

N 

S2N ■ k>®(g)l(*,0,x,)>2i |(^,l,0))2m (11) 

i=l 

N 

^ |O)0(g)|(i,l,O))2. \{ji,0,XjJ)2i+l. 
i=l 

Since is the same as the sorting network, but with the order of the gates reversed, 
the cost for the unsort step is the same as for step 2. 

5) Final formatting The final step in the algorithm is to write the data back to 
the original registers and clear the ancilla space. As with the initial formatting, F, the 
subroutine F has depth 1. Acting on the same registers as for its counterpart F, it works 
as follows: 

F : |0,0,...,0)|0,0, 

^ |jl,j2, 



3.3 The parallel look-up algorithm 

We now present the algorithm for accessing memory in parallel, implementing the unitary 
U(N,N) defined in Eqn. (3). As with Vjv, a circuit for [/(at^at) depends only on the 
parameters N and d. We will generalize the algorithm given for Vn to show how C/(7v,7V) 
may be efficiently implemented. 

The overall structure for f7(Ar^jv) is the same as for Vjy- We first construct packets that 
include the original data and a flag to be used in the sorting step. After a sorting 
network is applied, we transform the data. Instead of the permutation used in the 
data moving algorithm, the transformation to use is composed of cascading, B, and 
copying, C. Finally, we reverse the sort and map the data back to the original registers. 
Accordingly, the parallel look-up algorithm can be written 

U^N^N) = F o ° °C o B o S2N ° F. (13) 

1) Initial Formatting For the parallel look-up algorithm, we use packets containing 
four items. Each packet contains an address and flag as before, but now we have two data 



N 



...,0)(V)|(i,l,0))2i \{ji,0,XJJ)2^+l 



(12) 



1=1 



N 



)(X)|(0,0,0))2. |(0,0,0))2m- 

i=l 
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registers of d bits, target-data and memory-data, each of d bits. The initial formating 
stage, F, resembles the same step in §3.2, 

N 

F ■■ |ii,...,jiv)|2/i,...,2/iV>ki,...,X7v>(g)|(0,0,0,0))2i |(0,0,0,0))2i+i (14) 

i=l 

N 

^ |0,...,0)|0,...,0)|0,...,0)(g)|(i,l,0,x0)2i \{ji,0,yi,0))2^+l, 

i=l 

putting all the data into the packets, where it can be processed by the rest of the 
algorithm. The initial formating step is achieved in one timestep. 

2) Sorting The sorting step is the same as before: we sort lexicographically reading 
only the address and flag of each packet. 

2N 

S2N ■ \0) (^\(ik, fk,yk,Xk))k (15) 
fe=i 

2N 

^ 1 0-) (g) I (i<^(fc) , f^(k) , Vaik) , Xa{k)))k ■ 
k=l 

Note that packets whose flag is |1) hold data in their memory-data registers, while 
packets whose flag is |0) hold 'target data' in their target-data registers. At the end of 
the sort, we are left with a sequence of the form 

... (i - 1, 0, y, 0)(i - 1, 1, 0, x,_i)(i, 0, y',0) . . . (i, 0, y" , 0)(i, 1,0, Xi){i + 1, 0, 0) . . . (16) 

3) Cascade The goal of cascade is to send a copy of the memory-data registers to 
the left into the empty memory-data registers of packets that have their flags set to |0). 
Since there is no way of knowing in advance how far the data will need to propagate, 
we need a method that works in all cases. For example, every processor could request 
data from a single processor, say jo = ji = ■ ■ ■ = Jn-i = 1- This can be achieved by 
dividing the cascade up into n smaller phases, where n = [log2(2A^)]. Accordingly we 
write 

B = Bn-loBn-2 0...oBioBo, (17) 

where each Bf^ phase acts on pairs of packets at relative index separation 2^ from one 
another, more or less in parallel. 

To achieve this transformation, each packet will need a fresh ancilla aux-phase register 
of [log2 n] bits. This will be used to store a number indicating the phase in which that 
packet acquires a copy of the data being cascaded to it. This record makes it much easier 
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to implement the whole cascade reversibly, and these aux-phases are to be retained until 
the cascade is later reversed. Each packet will also need an aux-action bit that persists 
only throughout a single phase (recycled from one phase to the next), being set if that 
packet receives data during that phase. 

Each phase involves many pairs of packets. Phase involves pairs with indices I and 
1 + 2'=, for all / for which / > and / + 2'= < 2N. If [2"'=^] is even (resp. odd), then 
the pair {1,1 + 2^) is said to be of even (lesp. odd) parity. The idea is that data can 
legitimately cascade from the packet at Z + 2*^ to the one at I if they have the same 
address and if the rightmost one presently has data but the leftmost one doesn't. It is 
enough to check that the leftmost one has |0) for its flag and |0) for its aux-phase, and 
that the rightmost one has |1) for its flag or something non-zero for its aux-phase. If 
that overall condition is met, flip the aux-action bit of the leftmost packet, because it 
will be receiving data this phase. It is necessary to set the aux-actions for all the even 
parity pairs first, then for all the odd parity pairs, because otherwise the same bits will 
be being read by different gates at the same time, and that violates the rules of the 
standard circuit model. Note also that the simple act of computing the aux-action bit 
may itself require a little extra ancilla scratch space. 

When all the aux-action bits have been correctly set for the present phase, cascade data 
leftwards, locally conditioned on those aux-action bits. For example, during phase 
when examining packets I and / -|- 2^=, if / has had its aux-action set, then we need to 
implement 

\{i,0,y',0))l \0)phase{l) \ihf,y,x))l+2'= \p)phaseil+2'') (18) 

^ \{i,0,y',x))i \k + l) 

phase{l) 

(The aux-action for I got set because one of / and p was non-zero.) This should be done 
for the even parity pairs first (say), then the odd parity ones. 

Finally, the aux-action bits need to be reset. This resetting is a 'local' operation, because 
during phase S/j, each packet need flip its aux-action if and only if its aux-phase is 
indicating that it was active this phase. Note that the condition for resetting an aux- 
action bit is completely different from the condition for setting it in the first place. 

The total effect of B will be to load up the aux-phase ancillas and to replace ev- 
ery instance of |(ii, 0, j/j, 0)) with |(ii, 0, y^, Xj.)), while preserving every instance of 
|(i,l,0,a;i)). 

4) Copying C is a simple depth 1 local operation. Every packet simply xors the 
contents of its memory-data into its target-data. This has the effect of mapping every 
Kii, 0, yi, XjJ) to 0, y^®Xj^,Xj^)), while every 1, 0, Xi)) gets mapped to 1, Xj, Xj)). 

5) Reversing the Cascade This map reverses the effect of the cascade, cleaning up 
all the aux-phase ancillas, making the packets ready for unsorting. 
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The total effect of [B~^ o C o is to replace every instance of 0, i/j, 0)) with 
|(ji,0,?/j ©Xj;,0)), while replacing |(i,l,0,Xj)) with | (z, 1, Xj, Xj)) for all i. This ac- 
tion does not change the ordering of the packets that only depends on the address and 
flag of each packet. Therefore the action is compatible with the sorting stages, as re- 
quired. 

6) Unsort The unitary S^^ unsorts just as before. 

7) Final Formatting The final step is to apply a formatting map, F, which works as 
follows, 

TV 

|0, . . . , 0)|0, . . . , 0) |0, . . . , 0) (g) I (i, l,X^,X^))2^ \ 0, Vi X^v , 0))2,+l (19) 

i=l 

N 

^ I ji, • • • , jN)\yi ®Xj^,...,yN® Xjj^)\xi, . . . , xat) (g) I (0, 0, 0, 0))2i I (0, 0, 0, 0))2i+i . 

i=l 

Note that this is a depth 2 map rather than a depth 1 map because each x, appears in 
two places, and these need to 'relocalise' as well as 'move'. 

3.4 Proof of Theorems 2 and 3 

Here we count up the total depth and width of circuits for Vjy and for t/(Ar^jv) , thereby 
proving Theorems 2 and 3. 

The total depth of the formatting subroutines is 0(1). 

The sorting subroutines require comparators that make a lexicographic comparison on 
[log2 A'"] -|- 1 bits, and lexicographic comparison is basically the first part of arithmetic 
subtraction (regard the bit-patterns as integers, subtract one from the other reversibly, 
read the sign bit of the output). This can be achieved efficiently in depth 0(loglog A^) 
[25]. The other thing a comparator does is to swap elements controlled on a sorting-bit. 
The sizes of our elements are 0(log A^-|- d), and so the sorting bit needs to be fanned out 
(using a binary tree) to 'copy' it across 0{log N + d) ancillas, before a depth 1 swap can 
take place. Therefore the swapping stage of a comparator takes depth 0(log(log A^ + d)), 
and so the total comparator has depth 0(loglog A'^ -|- log(log A'^ + d)) = 0{log{d\og N)). 
Since there are O(logA^) comparators in the AKS sorting network, the total depth of 
the sorting stage is 0(log A^ • log (d log A^)). 

The cascade subroutine has n = 0(log A^) phases, and as with the comparators used in 
sorting, each phase of cascade involves arithmetic (in fact equality testing) on objects 
of size 0{n), plus controlled copying of objects of size 0{d). Therefore a phase has 
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depth 0(logn + logd) = 0{log{dlog N)) and the total depth of the cascade part is 
0(logiV-log((ilog N)). 

The inner subroutine ('copying') of the algorithm for C/(jv^Ar) has depth 0(1) and requires 
no ancillas. 

Therefore the total depth of our algorithm for f/(Ar^Ar) is 0(log • log(dlog A^)), and for 
V]y it is of the same order. 

Counting bits, we see that besides the 0(A^(log(A^) + d)) input/output bits, we require 
0{N(log{N) + d)) more bits for the packets, O(A^logA^) sorting-ancilla bits, as well 
as 0(A^(log(A^) + d)) ancilla bits for temporary use while rendering the (lexicographic) 
comparators. Furthermore, for cascade each packet needs an aux-phase register and an 
aux-action bit (total 0(A^log(A^)) bits), as well as 0((log(iV) + d)) scratch space (e.g. to 
compute/reset the aux-action bit) that can be recycled between phases. Hence, the total 
circuit width is 0(A^(log(A^) + d)), which is linear in the width of the logical unitary in 
question, and therefore asymptotically optimal. □ 



4 Distributed Quantum Computing 

The circuit model is universal for quantum computing, with the gate set consisting of 
all single-qubit unitaries and the 2-qubit CNOT gate, for example. 

The circuit model allows any pair of qubits to be connected by gates and so allows 
arbitrarily long-range interactions. This model is very far from any likely implementation 
of a quantum computer. We imagine that a small number of long-range interactions could 
be possible, but most gates will need to be local. 

It is well known that if the qubits are laid out in a line and each could only connect 
with its nearest neighbours (one either side), then the resulting model of computation 
would still be universal, because it could emulate any circuit of the more general kind. 
The emulation proceeds in a straightforward fashion using SWAP gates to bring qubits 
next to each other so that nearest-neighbor gates can be applied. The price to pay in 
this emulation is (in general) an overhead factor of 0{W) in the depth of the algorithm, 
where W counts the total width (number of qubits) of the circuit being emulated. 

More generally, we could envisage having memory larger than a single qubit at each 'site', 
with connectivity more generous than simply being connected each to two neighbours 
in a line. Then the overhead depth factor (or emulation factor) of universal emulation 
could generally be reduced to something smaller than 0(W). Our design of §3.2 is useful 
in this context, as explained next. 
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4.1 Efficient Distributed Quantum Computing 



Considering a general quantum circuit of width W, let there be quantum processors, 
each with its own local memory of Q qubits. Suppose that Q ■ N > W and suppose the 
processors are interconnected as the A'^ nodes of a graph Q. We say that a circuit respects 
graph locality if every two-qubit gate in the circuit either has those two qubits lying in 
the same processor's local memory or else has them lying in neighbouring processors' 
memories (neighbouring with respect to G). In addition, we require that each gate is 
assigned to a processor that holds at least one of its qubits, and each processor can 
only perform one gate per timeslice. Together, these restrictions on the ordinary circuit 
model define the distributed quantum computing model formally. 

What then is the best overhead depth factor for arbitrary circuit emulation in the worst 
case? Starting with an arbitrary circuit of W qubits (one perhaps not respecting the 
locality of Q), we wish to emulate it using a circuit that respects graph locality. We want 
the overhead depth factor of this emulation to be as small as possible. That is, we wish 
to minimise the function 

where C is a circuit for emulating C subject to the constraints imposed by Q and Q. 
Maximisation is over all circuits C of width W so F{Q,Q,W) is the worst case cost of 
emulating arbitrary circuits. Normally we are concerned with the case W = \Q\ = N , so 
that an emulation always has one processor per qubit being emulated, and where Q is 
large enough to hold ancilla for basic computations. 

The emulation factor, F{Q, Q, W), must be at least the diameter of G, and that in turn 
depends on the order, A^, and valency, val{Q), of the graph in question. The following 
Theorem demonstrates a nice trade-off reducing the overhead depth factor significantly, 
while keeping both Q and val{Q) 'reasonably' small. 

Theorem 4. For a distributed quantum computer with \Q\ := N , we can find a graph 
Q for which val{Q) = OilogN), and take Q = 0{logN), and yet have overhead depth 
f actor F (G , Q , N) = 0{log'^ N) for emulating arbitrary quantum circuits of width N. 

Proof : The proof of the Theorem uses our algorithm for Vn {cf. §§2.3, 3.2). The proof 
is clearest when we take d = 1, but other settings are possible. In overview, let each 
processor be big enough to hold two packets and some ancillas, i.e. Q = 0{d + log A^). 
First we show how Vn with data size d = 1 can be efficiently implemented in this 
processor model {i.e. as a circuit respecting graph locality), where the data bits are 
distributed one per processor. This is more or less a direct embedding of our algorithm 
for V/v into the processor model, and serves to define the graph G of the Theorem. Then 
we show how any general circuit of width A^ can be efficiently fit into the processor 
model, emulated in terms of intra-processor gates and renditions of Vat only. 
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Implementing V/v To implement V/v across processors, start by putting one data 
bit per processor. We need to implement subroutines Format, Sort, and Permutation 
(c/. Eqn. (6)). The first and last of these are entirely local operations, so their circuits 
are already admissible to the processor model. 

It only remains to check that the circuit for the sorting subroutine is admissible within 
the processor model, and that of course depends on the graph Q. The gates of this 
sorting subroutine all belong to comparators of the sorting network. Each comparator 
can be admitted within the processor model provided that the topology of Q is inherited 
from the topology of the sorting network. The sorting network itself has depth 0(log N) 
in the optimal case, and so there exists a graph Q with valency 0(log N) with regard to 
which our algorithm for Vjy embeds unaltered. 

Since there can only be one gate per processor per timeslice in this model, the depth of 
Vn goes up from 0(log Tog((ilog A^)) to 0(logA^- (d + logN)). This latter figure may 
be written 0(log^ N) when d = 1. 

Emulating general circuits via Vn Given a general circuit of width N, in each 
timeslice there will be at most gates. These gates should be 'assigned' one per 
processor when the timeslice is emulated. When a processor comes to emulate a gate 
assigned to it, it will need access to the one or two qubits of that gate. The emulation of 
a timeslice therefore requires two calls to the subroutine Vjy: without loss of generality, 
we can assume that the first qubit of a gate already resides at the processor to which 
that gate has been assigned; the first call to Vn brings the second qubit of each gate to 
its processor; the processor implements the gate locally on the two qubits; the second 
call to Vn restores the second qubit of each gate to its original home. Null ancilla qubits 
can be included within each Vn operation in order to make it a permutation. 

Every timeslice of the circuit being emulated now additionally requires two calls to Viv, 
plus appropriate 0(log A^)-sized circuitry (per processor) to write and erase the indices 
ji used within Vn- Overall this costs an overhead depth factor of O(log^A^). This 
completes the proof of Theorem 4. □ 

4.2 Implementing quantum circuits on any architecture 

In §3.1, we pointed out that for any (positive integer) T there must be a lowest-depth 
sorting network for sorting T elements deterministically. More generally, one can say 
that for any connected graph ^ of T vertices, there must exist a lowest-depth sorting 
network for sorting T elements, all of whose comparators lie along edges of Q. For 
example, when T = 4 and ^ is a 4-cycle, then the bitonic sorting network fits the graph, 
and, having six comparators in depth 3, is optimal. But for T = 4 and Q a graph of three 
edges in a line, the bitonic sort is not possible, and a sort involving six comparators in 
depth 4 turns out to be optimal for this graph. 
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In Theorem 4, we constrained the valency {val{G)) and order (N) of Q, but otherwise 
permitted any connected graph, and for the proof of the Theorem we used the graph 
impHed by the best known sorting network for that A^. This concept can be generahsed to 
any given connected graph G on N vertices. The problem of identifying the best sorting 
algorithm for N items that is compatible with G is essentially the same as identifying the 
most efficient circuit for implementing V/v in the distributed quantum computing model. 
Hence, the overhead depth factor associated with the distributed quantum computing 
model employing graph Q is related to the cost of the optimal sorting algorithm over 
Q. Scheduling tasks on topologically constrained (quantum) computing platform may 
be regarded as (more or less) equivalent to the problem of designing sorting algorithms 
(including sorting networks) commensurate with those topologies. We put these ideas 
on a firmer footing in the following Theorem. 

Theorem 5. Let Q he any connected graph on N nodes, and consider the distributed 
quantum computing model with graph Q and Q = O(logA^) qubits per processor. Let 
Dg denote the depth of the best algorithm for sorting 2N arbitrary bit strings of length 
[log2 A^] + 2 over Q (with Q qubits per node). Then the minimum depth overhead, 
F'miniGjQjW), for emulating arbitrary circuits is bounded by 

Fmin{g,Q,N) < 0{Dg) (21) 

and 

'^( logiVtoglogiv ) S ^-»(e.<3.0(JVlogiV)). (22) 



Proof : To prove the upper bound (21) we give an explicit construction analogous to 
the one used in Theorem 4. It emulates a circuit using an efficient sorting algorithm 
whose depth is given by Dg. For any circuit C of width W = N qubits, we construct 
its emulator circuit, C", by assigning each qubit and gate of C to a processor of C and 
then use an implementation of Vm (with d = 1) to move qubits around so that gates are 
always local. Recall from §3.2 that our description of V/v required sorting 2N packets, 
each of size [log2 NA^ + 1 + d bits. 

The rest of the proof for the upper bound then mirrors the proof for Theorem 4. As 
before, the formating and copying steps take time O(logA^), the only difference here is 
that the sort is achieved in depth Dg. Hence the cost of this emulation algorithm is 
F{g, Q, N) = 0(log N + Dg) = 0{Dg) since 0{Dg) > 0(log A^). 

For the lower bound (22), consider the AKS sorting network for sorting 2A^ packets 
of bit-length [log2 A^] + 2. Let C be the (unconstrained) circuit for achieving this, so 
the width of C is 17 = O(A^logA^) and its depth is 0(log A^loglog A^) (c/. §3.4). The 
cost of emulating C is bounded by Dg since the emulation is a sorting algorithm on 
the (^, (5)-distributed computer and we defined Dg to be the depth of the best sorting 
algorithm. Hence the cost of any emulation is lower bounded by 0{Dg / log A^ log log A^), 
as required. □ 
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5 Revisiting Popular Algorithms 



Grover's quantum search algorithm [16] and the generahzation to amplitude amplifica- 
tion [7, 8] have the great advantage of being widely applicable. Problems such as element 
distinctness [10, 4] collision finding [9], triangle finding [5, 20], 3-SAT [3] and NP-hard 
tree search problems [11] all have faster quantum algorithms because they are able to 
make use of amplitude amplification. In this Section, we revisit Grover's quantum search 
algorithm and the Element Distinctness problem, resolving a challenge posed by Grover 
and Rudolph [17]. 

5.1 Quantum search of an unstructured database 

Grover's fast quantum search algorithm [16] is usually presented as solving an oracle 
problem. We are presented with a function / : {1, . . . ,N} — t- {0, 1} and wish to find 
solutions Sj such that f{sj) = 1. We also have an oracle with the ability to recognise 
solutions Sj in the form of a unitary 

Uf : \y)\b) ^ \y)\befiy)). (23) 

Setting the target register, \b), of Uf to the state (|0) — \l))/\/2 encodes the value of 
f{y) into a phase shift 

Uf : \y) ^ (-l)^(^)|y), (24) 

where we have suppressed the target register since it remains unchanged. Grover's al- 
gorithm then makes 0{^/N/M) calls to Uf to find one of the M possible solutions, Sj, 
uniformly at random. To find more solutions we simply repeat the algorithm, and by a 
coupon-collector argument [13], we find r solutions in overall circuit depth 0{r logry^N/M). 
Throughout this Section, we set the width and depth of a function evaluation to be 0(1), 
since the cost is simply an overall multiplicative factor. 

Grover's algorithm can be applied to search an unstructured database. We construct 
the oracle by using the single memory look-up unitary, f^(i,Ar), together with a simple 
function the compares an input to the database entry (see Chap. 6.5 in [21]). More 
generally, suppose we wish to find solutions to a function whose inputs cannot be simply 
expressed as the numbers from 1 to N, but rather are taken from a database of elements 
X = {xj : j = 1, . . . , A^}, where each Xj is a bit string of length d. That is, we have a 
function a : X — )• {0,1}, and are searching for solutions Xj G X such that a{xj) = 1. 
Once the database item Xj has been loaded into the computational memory, the function 
a is computed using the unitary 

Ua : \j)\xj)\b) ^ \y)\xj)\b®a{xj)). (25) 

We consider the case where the database is held in a quantum state . . . , x^r), but it 
could also be a classical database whose indices can be accessed in superposition [14, 15] . 
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An oracle is constructed by first looking-up a database entry and then testing if this entry 
is a solution to the function a. The initial state of the computer is 



|j)|0)|6)|xi,...,xjv) (26) 

where the state |j) is the index of the database item we will look-up and |0) and \h) are 
auxiliary states used to load the memory and store the result respectively. 

First we apply the single memory look-up unitary, ?7(i^Ar), (see Definition. 1) so that the 
computer is in the state 

(27) 

then calling the function a, using the corresponding unitary [/q,, maps the state to 

\j)\xj)\h®a{x.^)\xx,...,xt^). (28) 

Finally we restore the auxiliary state used to load the database item by applying ^-j = 
C/(i 7V)- The final state of the computation is therefore 

|j)|0)|6ea(x,-))|xi,...,xjv). (29) 

Hence the unitary 

Oa = V{\,N) oUaO f/(i,Ar) (30) 

can be used as an oracle for Grover's algorithm. Setting the target register to (|0) — 
|l))/\/2 encodes the value of a{xj) into a phase shift 

0„ : \j) ^ (-1)"("^)U). (31) 

The quantum circuit implementing Grover's algorithm with Oa as an oracle requires 
circuit width O(iVlogA^) and depth 0(y^iV/M log iV) to find one solution. 

Now that we have an efficient algorithm for performing parallel memory look-ups, we 
consider the effect of using the unitary [/(at^at) together with (up to) functions as an 
oracle for Grover's algorithm. 

Suppose we have (up to) N functions that can take database elements as inputs, Oi : 
X —7- {0, 1}, for i = 1, ... A. Just as with a standard Grover search, each Qj can be 
any function provided it is computable in time polynomial in logA^ (and d). We wish 
to find solutions Xj. G X such that ai{xj.) = 1, for all i = 1, . . . , A. The following 
theorem provides an algorithm that finds (up to) A solutions, one for each function, 
using the same size quantum circuit (up to log factors) as would be required to find only 
one solution using the unitary f^{i,Ar) as an oracle. 

Theorem 6 (Multi-Grover search algorithm). Using the notation defined above, there 
is a quantum algorithm that for each i = 1, . . . , A either returns ji such that ai{sj.) = 1, 
or, if there is no such solution, returns 'no solution'. The algorithm can be implemented 
using a quantum circuit with width 0(A) and depth 0{^/N). 
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Proof : We proceed as with the single Grover search of an unstructured database. The 
only subtlety being that we need to organise the Hilbert space in the correct way. We 
want to perform N Grover searches over the database so we need N indices \ji . . -Jn), 
N memory place holders |0 . . . 0) and N target registers |6i . . . 6^?). 

It is useful to split the circuit in to N 'processors' since we will think of each one as 
performing a Grover search over the database. We rearrange the Hilbert space across N 
processors as 

N 

\ji,...,jN)\0,---,0)\bi,---,bN)\xi,...,XN) = <^\ji,0,bi,Xi). (32) 
Applying the parallel look up algorithm, maps the initial state of the computer to 

N 

<^\ji,Xj,,bi,Xi). (33) 

i=l 

We define a circuit for implementing the unitaries C/q. in parallel as 

N 

Ua = (34) 
1=1 

which acts on all of the target register simultaneously sending the state to 

N 

(^\ji,Xj^,bi® ai{xj^),Xi). (35) 

i=l 

Finally, we clear the auxiliary register used to load the memory item using the unitary 
U(N,N)j SO that the final state of the computer is 



Setting the target registers (|0) — \l))/\/2 produces the required oracle using a cir- 
cuit with width and depth O(A^logA^) and O (log log log A^), respectively (c.f. Theo- 
rem 2). 

Grover's algorithm then calls the unitary 



N 



\ji,0,bi e ai{xjJ,Xi). 



(36) 



(37) 



\/iV times. The resulting algorithm finds one solution for each function aj using a circuit 
with width O (TV log A^) depth 0{VN log A^ log log N) . □ 



21 



If X were highly structured (such as being the numbers 1 to N), it would be straight- 
forward to perform 0{N) Grover searches in parallel, using a circuit of width 0{N), 
since we would not need to store X explicitly. Now, making use of the efficient par- 
allel memory look-up algorithm to access X, we are able to interlace the steps in the 
Grover algorithm with database look-ups. The end result is that we can indeed per- 
form 0{N) Grover searches in parallel regardless of the structuring of X. In the next 
subsection we examine the effect of Theorem 6 on existing memory intensive quantum 
algorithms. 

5.2 Element Distinctness 

In this Section, we present a quantum algorithm for the element distinctness problem: 
given a function / : {0,1}" — >■ {0,1}", determine whether there exists distinct i,j £ 
{0,1}" with /(i) = /(j). 

The size of the problem is parametrised by = 2", and let S denote the available mem- 
ory, measured in n-bit words. Suppose that / can be evaluated in time and space 0(1). 
Previous quantum algorithms for element distinctness require time T satisfying 

ST^ = d{N^). (38) 

Buhrman et al [10] achieve this for S up to N^^'^; Ambainis [4] extends this to S up 

to A^2/3_ 

The Buhrman et al and Ambainis algorithms are for single processor machines. Grover 
and Rudolph [17] have pointed out that the notion of a single processor machine with 
large memory makes little sense in the quantum world. They argue that requiring space 
S is no better than using S processors, and show that for any S, the trade-off in Eqn. (38) 
can be achieved by simply having each processor apply Grover's search algorithm to a 
search space of size 0{N'^/S). Grover and Rudolph pose beating this trade-off as a 
challenge. 

Our algorithm answers this challenge: we achieve the trade-off 

ST = d{N) (39) 

for essentially all S up to N. It can be thought of in the processor model (like the Grover- 
Rudolph algorithm) with k processors each using space S/k and total depth 0{N/S), 
for some k. It is a variant of the Buhrman et al algorithm [10], requiring processors to 
access a shared memory of size 0{k). Equivalently we can describe it in the circuit model 
using calls to C/(fc,fc) with total width S and total depth 0{N/S). (Theorem 4 guarantees 
that conversion factors between these two models are polylogarithmic.) 

We begin by constructing a database of function evaluations X = {f{j) : j = 1 . . . N} 
which takes time N/k since we have k processors. Now consider the following algorithm 
that checks if the first k items are marked. 
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• Sort the numbers /(I), . . . , f{k) and check if any of them are equal. This can be 
achieved using the reversible AKS sorting network and so takes time T = 0(log k). 

• Using the sorted list, L, we construct a function, g : L x X ^ {0,1}, defined by 

Since L is sorted, g can be computed in time T = 0(1). 

• Using the multi-Grover algorithm given in Theorem 6, we can search the remaining 
N — k database elements in time T = O 

This algorithm checks if any of the first k indices result in a match, /(i) = /(j) for 
i = 1 ... A: and j = 1 . . . N . It succeeds with probability N/k and can be repeated for 
any block of k indices. We can use the algorithm as a Grover oracle so that calling 
this oracle N/k times solves the element distinctness problem. The overall time taken 
is 

^ f N [n In - k\ ^ rN\ , , 

The above discussion proves the following theorem. 

Theorem 7. For any k < N, the distinctness of N n-hit strings, can he decided by 
a quantum circuit with width 0{k ■ logA^) and depth 0{{N/k)logk), resulting in the 
trade-off 

ST = d{N). 

5.3 Collision problem 

Our results also apply to the collision problem: in which / : [N] i— [N] is an efficiently- 
computable function with the promise of being either 1-1 or 2-1, for which an ST = 
query algorithm is given in [9]. This problem may be solved with 

ST = d{VN) 

either by selecting 0{\fN) random elements and solving element distinctness, or by 
simply using the algorithm of [9] directly, augmented by using S processors with shared 
memory together with our look-up algorithm. So we perform S Grover searches in 
parallel on spaces of size N/ S"^ instead of a single search on a space of size N/ S. 
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6 Conclusion and discussion 



We have presented a new algorithm for accessing quantum memory in paraUel. The 
algorithm is extremely efficient; it has an overhead that is scarcely larger than any 
algorithm capable of accessing even a single entry from memory. 

The first application of this algorithm is to distributed quantum computing that is con- 
strained to respect the locality of a graph. A variant of the parallel look-up algorithm, 
which we call the data-moving algorithm, provides an efficient way of mapping circuits 
respecting only the complete graph to circuits respecting this limited graph. In The- 
orem 4, we presented a particularly nice situation where the properties of the limited 
graph are balanced with the cost of emulating the circuit model. Each of the proces- 
sors contains 0(log A) qubits and has 0(log A) connections to other processors and yet 
arbitrary quantum circuits can be emulated with an overhead of 0{log^ A). 

One can think of our data-moving algorithm and Theorem 5 as a proposal for an efficient 
distributed quantum computer. An architecture based on the bitonic sorting network 
(we use the bitonic network since the constants for the asymptotically optimal AKS 
network are too large), would be able to efficiently simulate any algorithm presented in 
the circuit model. 

The idea of using sorting networks as a tool for constructing efficient quantum communi- 
cation protocols opens up many interesting questions and possibilities for future research. 
For example, we note that the parallel look-up map (Definition 2) is an example of 'pull' 
map: each 'instruction' ji has a location and a value; the location describes the desti- 
nation of a data item being transferred {i.e. where it's being pulled to), while the value 
describes the source of that data item (i.e. where it's being pulled from). Our algorithm 
can be extended to perform analogous 'push' maps, where the roles of destination and 
source are exchanged (see Appendix A for more details). More generally, it provides a 
framework for efficient communication in distributed quantum computing. 

We have demonstrated that the parallel look-up algorithm can be used to optimize 
existing quantum algorithms. We provided an extension of Grover's algorithm that 
efficiently searches over a physical database for multiple solutions, and answered an 
open problem posed by Grover and Rudolph by demonstrating an improved space-time 
trade-off for the Element Distinctness problem. It seems likely that this framework 
for efficient communication in parallel quantum computing will be a useful subroutine 
in other memory- intensive quantum algorithms too, such as triangle finding, or more 
generally for frameworks such as learning graphs. 
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Appendix A: Other Related Algorithms 
Push vs Pull 

The data-moving map (Definition 3) and parallel look-up map (Definition 2) are both 
examples of 'pull' maps: each 'instruction' ji has a location and a value; the location 
describes the destination of a data item being transferred {i.e. where it's being pulled 
to), while the value describes the source of that data item (i.e. where it's being pulled 
from). For completeness, we also describe analogous 'push' maps, where the roles of 
destination and source are exchanged. 

Definition of parallel-push 

The 'push' analogue of the data-moving map Vn is very trivial to define, since it is 
nothing other than the inverse of that map, Vj^^. However, the 'push' analogue of the 
parallel look-up map [/(at^at) is much more interesting. Since during a parallel operation it 
is possible that more than one data store might want to 'push' data to a given location, 
any map for a generic pushing algorithm can only be defined with respect to some 
methodology that describes how data should be combined when such 'collisions' occur. 
A particularly natural way of doing this would be to employ a computable monoid. A 
monoid comes equipped with an associative group multiplication operator that allows 
two (or more) data items to be combined into one. A nice example of a monoid (in 
fact a group) would be the group of length-d bit-strings with the xor operator for group 
multiplication. (The reader may observe that we did already use this monoid — albeit in a 
relatively benign fashion — at one point in Definition 2 for the parallel look-up algorithm, 
for combining target variables, y, with memory data, x.) But other computable monoids 
can equally well be employed in the definition. 

As before, we use index registers ([log2 N~\ bits each), target registers {d bits each), 
and A^ data registers {d bits each), letting ^ denote some monoid operator acting on 
data items (d-bit strings), define the parallel-push map as follows. 
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Definition 4. A logical unitary for parallel push with respect to monoid operator ^ is 
a map VF(jv,Af,0) acting on a series of quantum registers that implements 

W(^N^N,Q) ■ \jl,---jN)\yi,---,yN)\xi,...,XN) (42) 
^ I jl, • • • ,jN)\yi ®^Xi,...,yN \xo, xn)- 

When the monoid operator is non-commutative, the instruction 0^.=^ Xi should be 
understood as computing the product of all Xj for which ji = k, with the Xj ordered by 
lexicographical ascending ordering of the indices i, e.g. xi ® X4 © xg. 

Parallel push algorithm 

The complexity of the parallel push subroutine, VF(jv,Af,0); naturally depends on the 
complexity of the individual operation. In all other respects, the complexity is much 
the same as for the parallel look-up subroutine, C/(Ar,Ar). 

For brevity, we omit a complete description of our algorithm for VF(7v,Af,0) ) instead 
give an overview of how it differs from the algorithm for C/(7v,Af) given in §3.3. 

Initial Formatting prepares two packets per index, as before, but this time taking the 
form (i,0, yj,0) and (ji,l,0, Xj). Sorting is the same as before: lexicographic on the 
address (first entry) then the flag (second entry). At the end of the sort, we are left with 
a sequence of the form 

...(i,0,yi,0) (z,l,0,Xfc) (i,l,0,Xfc/) ... (i, 1, 0, x^//) (i -M, 0, y^+i, 0) . . . (43) 

The total effect of the Cascade will be to load up aux-phase ancillas and to perform 
monoid operations on the memory-data registers, so that each accumulates the sum of 
those elements to the right that are in the same 'z-block'. Copying implements a monoid 
operation, acting only on the leading packet of each 'i-block', mapping {i,0,yi,x') to 
(i, 0, yj © x', x'), where x' = ^j^^^x^, before Reversing the Cascade. The Unsort 
step naturally matches the earlier Sort step to reverse it. Final Formatting com- 
pletes the process by copying data out of the packet pairs (i,0,y^ = Vi® 0j^=j^fc)O)) 
(ji,l,0,Xi). 
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